Advanced JSON Techniques

The basics of using JSON within DB2 were covered in a previous notebook. This notebook will cover some of the advanced techniques that are required to access arrays and structures that may be nested within a JSON document. The first step (as with any notebook!) is to load the Db2 Jupyter magic commands.


In [ ]:
%run db2.ipynb

This section will work with a larger table that will give us some idea of the performance of using JSON within a Db2 database. This next statement will generate 25,000 customer records that have a structure similar to this:

{
    "customerid": 100000,
    "identity": {
        "firstname": "Jacob",
        "lastname": "Hines",
        "birthdate": "1982-09-18"
    },
    "contact": {
        "street": "Main Street North",
        "city": "Amherst",
        "state": "OH",
        "zipcode": "44001",
        "email": "Ja.Hines@yahii.com",
        "phone": "813-689-8309"
    },
    "payment": {
        "card_type": "MCCD",
        "card_no": "4742-3005-2829-9227"
    },
    "purchases": [
        {
            "tx_date": "2018-02-14",
            "tx_no": 157972,
            "product_id": 1860,
            "product": "Ugliest Snow Blower",
            "quantity": 1,
            "item_cost": 51.86
        },...
    ]
}

The JSON record contains four distinct pieces of information:

  • Customerid - Primary key
  • Identity - Information on the customer including name and birthdate
  • Contact - Address, email, and phone number information
  • Payment - Current payment card that is used
  • Purchase - The purchase that the customer has made

The purchase structure contains information on the customer purchases. For each purchased items there is the following information:

  • tx_date - Date of the transaction
  • tx_no - Transaction number
  • product_id - Id for the product
  • product - Name of the product
  • quantity - Quantity of products purchased
  • item_cost - Cost of one product

If this was a relational database you would probably split these fields up into different tables and use join techniques to bring the information back together. In a JSON document we are able to keep all of this information in one record, which makes retrieval of an individual customer purchases easier.

Connect to the Db2 Database

You may need to add connection information here if you have not yet connected to the database.


In [ ]:
%sql connect

We need to create JSON records for loading into the table. This next command will run the db2json.ipynb file which contains code to generate JSON records. If you want to change the number of records created or the data used, edit the file to modify the defaults. The program will display its progress as it runs.


In [ ]:
%run generate_json.ipynb

Load into Customer Table (BSON)

The table that we create for JSON data will include a column with the actual JSON record and another field as an identifier for the row. The load statement in the next section of code will insert the customerid from the JSON record into the CUSTNO column.


In [ ]:
%%sql -q
  DROP TABLE CUSTOMERS_BSON;
  CREATE TABLE CUSTOMERS_BSON 
    (
    CUSTNO  INT NOT NULL,
    DETAILS BLOB(2048) INLINE LENGTH 2048
    );

The next piece of code will import the data into Db2. The Python code will read the customer.js file, parse it and then insert in into the CUSTOMERS table. Note that the entire JSON record is stored into the DETAILS column, while the CUSTNO column is set to the customerid field that is found in the JSON record.


In [ ]:
import io
import json
print("Starting Load")
start_time = time.time()
%sql autocommit off
x = %sql prepare INSERT INTO CUSTOMERS_BSON VALUES (?,SYSTOOLS.JSON2BSON(?))
if (x != False):
    i = 0
    with open("customers.js","r") as records:
        for record in records:
            rec = json.loads(record)
            custno = rec['customerid']
            i += 1
            rc = %sql execute :x using :custno, :record
            if (rc == False): break
            if ((i % 5000) == 0): 
                print(str(i)+" rows read.")
                %sql commit hold
                
    %sql commit work  
%sql autocommit on
end_time = time.time()
print('Total load time for {:d} records is {:.2f} seconds'.format(i,end_time-start_time))
customer_bson = end_time - start_time

To double-check the JSON format we retrieve one record from the table.


In [ ]:
%sql -j select systools.bson2json(details) from customers fetch first 1 row only

The table does not have an index currently defined on it. We could have created the table with a primary key but chose not to in order to speed up the load step. The next statement will create the index for our table.


In [ ]:
%sql CREATE UNIQUE INDEX IX_CUST_BSON ON CUSTOMERS_BSON(CUSTNO)

Db2 JSON Macro Extensions

The Db2 notebook on JSON features described how to retrieve individual fields from a record by using the JSON_VAL function. The following code retrieves the first and last name of customer number 100000.


In [ ]:
%%sql 
SELECT JSON_VAL(DETAILS,'identity.firstname','s:32') AS FIRSTNAME, 
       JSON_VAL(DETAILS,'identity.lastname','s:32') AS LASTNAME
FROM CUSTOMERS
WHERE CUSTNO = 100000

In order to make JSON retrieval easier in Jupyter notebooks, the Db2 Jupyter %sql command allows for the creation of macros. Macros are used to expand text in a SQL statement. Macros do not execute any code, but are used specifically to insert text into a SQL statement. An example of a macro is found below.


In [ ]:
%%sql macro helloworld
echo Everything you said is here: {1}
exit

In [ ]:
%sql %helloworld(Hello World)

A number of macros can be found in the db2jon.ipynb file. We can insert the file in the next step to load these macros.


In [ ]:
%run db2json.ipynb

To use a macro, we precede it with the % character and enclose any parameters in brackets:

SELECT %js(details,identity.firstname,32) FROM CUSTOMERS WHERE CUSTONO = 100000

The %js macro will expand the values in the string to include the JSON_VAL(details, ...) in the SQL. To echo the contents of what the macro produces, we need to include the -e flag in the SQL statement.


In [ ]:
%%sql -e
SELECT %js(details,identity.firstname,32)
FROM CUSTOMERS
WHERE CUSTNO = 100000

You will notice that a green box is placed around the generated code and the SQL has been modified by the macro to include the full JSON_VAL function. This makes it easier to create the SQL (and hopefully less error prone) when dealing with JSON objects. There are 5 macros that have been predefined for JSON queries.

  • -e - This flag turns on the display of the generated SQL. This is useful if you want to cut and paste the final SQL into an application. If you do not use the -e flag, the code will not be displayed.
  • #js - Return a string from a JSON field
  • #ji - Return an integer from a JSON field
  • #jd - Return a number (decimal) from a JSON field
  • #jdate - Return a date from a JSON field

The format for the ji (Integer) and jdate (Date) macro is:

#ji(json_column,json_field)
#jdate(json_column,json_field)

Note that you do not need to enclose quotes around the column or field names unless they have special characters in the name. The quotes are stripped out of the parameters but the macros will add them back into the SQL function if necessary. Also, field names are case sensitive in JSON so make sure you use the proper spelling for the field you are trying to retrieve. The column name that contains JSON data is not case sensitive.

The #jd macro works similar to the #ji macro except that it provides for an additional decimal formatting operator:

#jd(json_column, json_field, 'digits,decimals')

A decimal number if formatted as x,y where x is the number of digits in total, and y represents the number of digits after the decimal point. So a value of 5,2 would mean there are 5 digits in total with 2 of them after the decimal point.

The #js (JSON string) macro the following format:

#js(column,field,length)

The macro needs which JSON column to use, the field that is being searched for, and the length that needs to be returned. Note that you do not use the JSON_VAL specification in this field ('s:255'). Instead you are supplying the length of the return string as an integer.

The next SQL command illustrates the use of these macros to return the customers first name, lastname, address, zipcode, date of birth and phone number from the JSON record.


In [ ]:
%%sql -e
SELECT
  %js(details,identity.firstname,12) AS FIRSTNAME,
  %js(details,identity.lastname,15) AS LASTNAME,
  %js(details,contact.street,30) AS STREET,
  %js(details,contact.city,20) AS CITY,
  %js(details,contact.state,2) AS STATE,
  %ji(details,contact.zipcode) AS ZIPCODE,
  %jdate(details,identity.birthdate) AS BIRTHDATE,
  %js(details,contact.phone,12) AS PHONE
FROM CUSTOMERS
WHERE CUSTNO = 100000

Using the -e flag is useful when debugging any SQL that uses JSON functions. In addition it gives you the raw SQL that is generated so that you can cut and paste it into your own application. The examples with strings use the following format:

TRIM(JSON_VAL(...'s:12'))

If you use a specification that is too small to retrieve the field (s:12), then the JSON_VAL will return a NULL value rather than a truncated string. For this reason, you may want to use a larger size and the value will be trimmed down to a smaller size.

The final macro that is included in the list is the #jsonarray() macro. The macro call has the following format:

jsonarray(table_name, pk, json_column, array, [optional where clause])

The fields are:

  • table_name - table that we are accessing
  • pk - primary key of the table
  • json_column - column that contains the json data
  • array - the field that contains an array of items (which must be a structure and not individual values)
  • where clause - An optional expression to limit the result set

The #jsonarray macro is used in situations where you want to retrieve the contains of an array within a JSON record. The macro will create a WITH clause specification where the name of the resulting table is called JSONARRAY with two columns: PK for the primary key to the array record and ITEM for the items returned. The following example shows how you would retrieve all of the purchases that a customer made using the JSON_TABLE function.


In [ ]:
%%sql 
with purchases(items) as 
   (select systools.json2bson(items.value)
    from customers, 
    table( systools.json_table(customers.details,'purchases','s:2048')) as items
    where custno = 100000
   )
select %ji(items,product_id),%js(items,product,32),%ji(items,quantity),%jd(items,item_cost,'7,2') from purchases

This code can be simplified by using the %jsonarray macro instead.


In [ ]:
%%sql -e
WITH
  %jsonarray(customers,custno,details,purchases,where="custno = 100000")
SELECT 
  %ji(item,product_id),%js(item,product,32),%ji(item,quantity),%jd(item,item_cost,'7,2') from jsonarray

If you display the SQL that was created, you will see the temporary table specification jsonarrray(pk,item) that was created by the macro. You must start the SQL statement using the WITH clause otherwise the syntax will be incorrect.


In [ ]:
%%sql -e
WITH
  %jsonarray(customers,custno,details,purchases,where="custno = 100000")
SELECT 
  %ji(item,product_id),%js(item,product,32),%ji(item,quantity),%jd(item,item_cost,'7,2') from jsonarray

Sales by State

Now that we can access the data using the JSON macros, we can start doing some analysis on the records. For each of these queries, remember that there are no indexes being used to determine the answer. This first query will show the total sales per state. In order to determine this, we must take the array of purchases by a customer, calculate the total cost of their orders and sum it up for the state.


In [ ]:
%%sql -pb
WITH 
  %jsonarray(customers,custno,details,purchases), 
SALES(state, amount) AS
   ( 
   SELECT %js(c.details,contact.state,2), sum(%ji(j.item,quantity) * %jd(j.item,item_cost,'7,2')) 
   from jsonarray j, customers c 
   where j.pk = c.custno 
   GROUP BY %js(c.details,contact.state,2) 
   ) 
SELECT state, sum(amount) from SALES 
GROUP BY state 
ORDER BY 2 DESC

Orders of a Product

This calculation gives us the number of times a product was ordered. This doesn't tell us the quantity since we need to look at the individual purchases to see how many of the product the customer actually ordered. The SQL command will use the -a flag so that the output will be displayed in a TABLE so that you can find it easier to scroll the results.


In [ ]:
%%sql -a
WITH 
  %jsonarray(customers,custno,details,purchases), 
  SALES(product, quantity) AS ( 
      SELECT %js(j.item,product,32),count(*) 
      from jsonarray j 
      GROUP BY %js(j.item,product,32) 
      ) 
SELECT product, quantity from SALES 
ORDER BY 2 DESC

Quantity of Products Ordered

So this SQL will actually tell us the quantity of a product ordered.


In [ ]:
%%sql -a
WITH 
  %jsonarray(customers,custno,details,purchases), 
  SALES(product, quantity) AS ( 
      SELECT %js(j.item,product,32), %ji(j.item,quantity) 
      from jsonarray j 
      ) 
SELECT product, sum(quantity) from SALES 
GROUP BY product
ORDER BY 2 DESC

Here we need to get the product counts by state, and then pick the maximum quantity found in the answer set. The steps we need to perform in the SQL are:

  • Get all products purchased
  • Sum the sales of products by state
  • Compute the max for each state

The next set of SQL statements will get us the results. Note that it is possible to have more than one product being the top seller in a state. Note: This SQL may take a while to execute.


In [ ]:
%%sql -a
WITH 
  %jsonarray(customers,custno,details,purchases), 
  SALES(product, state, quantity) AS ( 
      SELECT %js(j.item,product,32), %js(c.details,contact.state,2), SUM(%ji(j.item,quantity))
      from jsonarray j, customers c
      where j.pk = c.custno
      GROUP BY %js(j.item,product,32), %js(c.details,contact.state,2)
      ),
  MAXSALES(state, total) AS (
      SELECT STATE, MAX(quantity)
      FROM SALES
      GROUP BY STATE
  )
SELECT s.state, s.product, m.total 
  FROM SALES s, MAXSALES m
WHERE
  s.state = m.state and
  s.quantity = m.total
ORDER BY s.product desc

Converting JSON into Relational Tables

As you can see from the previous examples, it is possible to analytics against the JSON data. However, it requires some sophisticated techniques for manipulating (and converting) the JSON data. One approach is to use the JSON functions and move the data into a traditional relational table for analysis. The following SQL will create a "flat" relational table that will allow for further analysis without having to do joins. In addition, we are going to create the table using column organization which will give us the ability to do queries without having to worry about what indexes are needed.


In [ ]:
%%sql
DROP TABLE TRANSACTIONS;
CREATE TABLE TRANSACTIONS
  (
  TX_DATE DATE,
  CITY VARCHAR(32),
  STATE CHAR(2),
  ZIPCODE INT,
  PRODUCT_NAME VARCHAR(32),
  PRODUCT_QTY INT,
  PRODUCT_COST DEC(7,2),
  CCARD CHAR(4)
  ) 
ORGANIZE BY COLUMN;

Next we use the #jsonarray macro to generate the rows we need to populate the transactions table.


In [ ]:
%%sql
INSERT INTO TRANSACTIONS 
WITH
  %jsonarray(customers,custno,details,purchases), 
  TXS(tx_date, city, state, zipcode, product_name, product_qty, product_cost, ccard) AS 
     ( 
      SELECT %jdate(j.item,tx_date), 
             %js(c.details,contact.city,32),
             %js(c.details,contact.state,2), 
             %ji(c.details,contact.zipcode),
             %js(j.item,product,32),
             %ji(j.item,quantity),
             %jd(j.item,item_cost,'7,2'),
             %js(c.details,payment.card_type,4)
      from jsonarray j, customers c
      where j.pk = c.custno
     )
  SELECT TX_DATE, CITY, STATE, ZIPCODE, PRODUCT_NAME, PRODUCT_QTY, PRODUCT_COST, CCARD
    FROM TXS

Next we can check our record count to see how many records we have.


In [ ]:
%sql SELECT COUNT(*) FROM TRANSACTIONS

Plot Routine

This code will produce a graph of the results from an SQL statement. This routine allows more flexibility on what is displayed on the chart, including changes the colors, the titles, and the axis names.

The format of the Plot is:

plotSQL(sql, "Title of the plot", "X Axis Title", "Y Axis Title", rotation, color palette)

If Rotation is set to True, the x axis labels will be rotated by 30 degress and place the text at a slight angle to make longer text easier to read. The color palette is either m1, m2, m3, or m4.


In [ ]:
import matplotlib.pyplot as plt

m1 = ('#009926','#00994d','#009973','#009999','#007399','#004d99','#002699','#000099')
m2 = ("#00ff40","#00ff80","#00ffbf","#00ffff","#00bfff","#0080ff","#0040ff")
m3 = ("#9933ff","#cc33ff","#ff33ff","#ff33cc","#ff3399","#ff3366","#ff3333")
m4 = ("#ff00bf","#ff0080","#ff0040","#ff0000","#ff0000","#ff4000","#ff8000","#ffbf00","#ffff00")

def plotSQL(sql, title, xaxis, yaxis, rotation, colormap):
    if (rotation == True):
        rot = 30
    else:
        rot = 0
    df = %sql {sql}
    if (df is None):
        print("SQL call failed")
        return
    xvalues = df.columns.values[0]
    yvalues = df.columns.values[1]
    df.plot(kind='bar',x=xvalues,y=yvalues,color=colormap);
    _ = plt.xlabel("", fontsize=12);
    _ = plt.ylabel(yaxis, fontsize=12);
    _ = plt.suptitle(title, fontsize=20);
    _ = plt.xticks(rotation=rot);
    _ = plt.plot();

Columnar: Top 10 States by Sales


In [ ]:
sqlin = \
'''
WITH 
SALES(state, amount) AS
   ( 
   SELECT STATE, SUM(PRODUCT_QTY * PRODUCT_COST) 
   from TRANSACTIONS 
   GROUP BY STATE
   ) 
SELECT state, sum(amount) from SALES 
GROUP BY state 
ORDER BY 2 DESC FETCH FIRST 10 ROWS ONLY
WITH UR
'''
plotSQL(sqlin, 
    "Sales per State", 
    "State",
    "Revenue",
    False,
    m1)

Columnar: Top 10 Products Sold


In [ ]:
sqlin = \
'''
WITH 
  SALES(product, quantity) AS ( 
      SELECT PRODUCT_NAME, PRODUCT_QTY 
      from TRANSACTIONS
      ) 
SELECT product, sum(quantity) from SALES 
GROUP BY product
ORDER BY 2 DESC
FETCH FIRST 10 ROWS ONLY
WITH UR
'''
plotSQL(sqlin, 
    "Products Sold", 
    "Product",
    "Quantity",
    True,
    m2)

Weekly Sales with Moving Average


In [ ]:
sqlin = \
"""
WITH TOTALPROD(TX_DATE, TX_TOTAL) AS 
  (
  SELECT WEEK(TX_DATE), SUM(PRODUCT_QTY*PRODUCT_COST) FROM TRANSACTIONS
      WHERE YEAR(TX_DATE) = 2017 AND WEEK(TX_DATE) <= 52
  GROUP BY WEEK(TX_DATE)
  )
SELECT TX_DATE, TX_TOTAL AS SALES, 
     AVG(TX_TOTAL) OVER (
       ORDER BY TX_DATE
     ROWS BETWEEN 8 PRECEDING AND CURRENT ROW) AS MOVING_AVG
  FROM TOTALPROD
  ORDER BY TX_DATE
"""
df = %sql {sqlin}
txdate= df['TX_DATE']
sales = df['SALES']
avg = df['MOVING_AVG']

plt.xlabel("Week of Year", fontsize=12);
plt.ylabel("Sales", fontsize=12);
plt.suptitle("Weekly Sales and Moving Average", fontsize=20);
plt.plot(txdate, sales, 'r');
plt.plot(txdate, avg, 'b');
plt.show();

Top Products Sales by State


In [ ]:
import seaborn as sns
cm = sns.light_palette("green", as_cmap=True)
sqlin = \
"""
WITH 
  SALES(product, state, quantity) AS ( 
      SELECT PRODUCT_NAME, STATE, SUM(PRODUCT_QTY)
      from TRANSACTIONS
      GROUP BY PRODUCT_NAME, STATE
      ),
  MAXSALES(state, total) AS (
      SELECT STATE, MAX(quantity)
      FROM SALES
      GROUP BY STATE
  )
SELECT s.state, s.product, m.total 
  FROM SALES s, MAXSALES m
WHERE
  s.state = m.state and
  s.quantity = m.total
ORDER BY m.total desc
"""
result = %sql {sqlin}
result.style.background_gradient(cmap=cm)

Credits: IBM 2018, George Baklarz [baklarz@ca.ibm.com]